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ABSTRACT 


In the pursuit of efficient automated content creation, procedural generation, lever- 
aging modifiable parameters and rule-based systems, emerges as a promising ap- 
proach. Nonetheless, it could be a demanding endeavor, given its intricate na- 
ture necessitating a deep understanding of rules, algorithms, and parameters. To 
reduce workload, we introduce 3D-GPT, a framework utilizing large language 
models (LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as 
proficient problem solvers, dissecting the procedural 3D modeling tasks into ac- 
cessible segments and appointing the apt agent for each task. 3D-GPT integrates 
three core agents: the task dispatch agent, the conceptualization agent, and the 
modeling agent. They collaboratively achieve two objectives. First, it enhances 
concise initial scene descriptions, evolving them into detailed forms while dynam- 
ically adapting the text based on subsequent instructions. Second, it integrates 
procedural generation, extracting parameter values from enriched text to effort- 
lessly interface with 3D software for asset creation. Our empirical investigations 
confirm that 3D-GPT not only interprets and executes instructions, delivering reli- 
able results but also collaborates effectively with human designers. Furthermore, 
it seamlessly integrates with Blender, unlocking expanded manipulation possibil- 
ities. Our work highlights the potential of LLMs in 3D modeling, offering a basic 
framework for future advancements in scene generation and animation. 


1 INTRODUCTION 


In the metaverse era, 3D content creation serves as a catalyst for transformative progress, redefining 
multimedia experiences in domains like gaming, virtual reality, and cinema with intricately crafted 
models. Yet, designers often grapple with a time-intensive 3D modeling process, starting from 
basic shapes (e.g., cubes, spheres, or cylinders) and employing software like Blender for metic- 
ulous shaping, detailing, and texturing. This demanding workflow concludes with rendering and 
post-processing to deliver the polished final model. While procedural generation holds promise 
with its efficiency in automating content creation through adjustable parameters and rule-based sys- 
tems 
[2023), it demands a comprehensive grasp of generation rules, algorithmic frameworks, and individ- 
ual parameters. Furthermore, aligning these processes with the creative visions of clients, through 
effective communication, adds another layer of complexity. This underscores the importance of 
simplifying the traditional 3D modeling workflow to empower creators in the metaverse era. 


LLMs have showcased exceptional language understanding capabilities, including planning and tool 
utilization (Imani et al.|[2023} Zhang et al.) 2023a) Gong et al-| 2023} Zeng etal. 2022). 

Furthermore, LLMs demonstrate outstanding proficiency in characterizing object attributes, such 
as structure and texture 2023), enabling 
them to enhance details from rough descriptions. Additionally, they excel at parsing concise tex- 
tual information and comprehending intricate code functions, while seamlessly facilitating efficient 
interactions with users. Driven by these extraordinary capabilities, we embark on exploring their 
innovative applications in procedural 3D modeling. Our primary objective is to harness the power 
of LLMs to exert control over 3D creation software in accordance with the requirements of clients. 
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In pursuit of this vision, we introduce 3D-GPT, a framework aimed at facilitating instruction-driven 
3D content synthesis. 3D-GPT enables LLMs to function as problem-solving agents, breaking down 
the 3D modeling task into smaller, manageable components, and determining when, where, and 
how to accomplish each segment. 3DGPT comprises three key agents: conceptualization agent, 3D 
modeling agent and task dispatch agent. The first two agents collaborate harmoniously to fulfill 
the roles of 3D conceptualization and 3D modeling by manipulating the 3D generation functions. 
Subsequently, the third agent manages the system by taking the initial text input, handling sub- 
sequence instructions, and facilitating effective cooperation between the two aforementioned agents. 


By doing so, they work toward two key objectives. First, it enhances initial scene descriptions, 
guiding them towards more detailed and contextually relevant forms while adapting the textual input 
based on subsequent instructions. Second, instead of directly crafting every element of 3D content, 
we employ procedural generation, making use of adaptable parameters and rule-based systems to 
interface with 3D software. Our 3D-GPT is equipped with the capability to understand procedural 
generation functions and extract corresponding parameter values from the enriched text. 


3D-GPT offers controllable and precise 3D generation guided by users’ textual descriptions. It 
reduces the workload of manually defining each controllable parameter in procedural generation, 
particularly within complex scenes that encompass diverse aspects. Moreover, 3D-GPT enhances 
collaboration with users, making the creative process more efficient and user-centric. Furthermore, 
3D-GPT seamlessly interfaces with Blender, granting users diverse manipulation capabilities: object 
transformations, material adjustments, primitive additions, object animations, mesh editing, and 
physical motion simulations. Based on our experiments, we posit that LLMs exhibit the potential to 
handle more intricate visual inputs. Our contributions are summarized as follows: 


e Introducing 3D-GPT, a training-for-free framework designed for 3D scene generation. Our 
approach leverages the innate multimodal reasoning capabilities of LLMs, streamlining the 
efficiency of end-users engaged in procedural 3D modeling. 

e Exploration of an alternative path in text-to-3D generation, wherein our 3D-GPT generates 
Python codes to control 3D software, potentially offering increased flexibility for real- 
world applications. 

e Empirical experiments demonstrate the substantial potential of LLMs in terms of their rea- 
soning, planning, and tool-using capabilities in 3D content generation. 


2 RELATED WORK 


2.1 TEXT-TO-3D GENERATION 


text-to-3D generation 


The common paradigm of them is to perform per-shape optimization with differentiable rendering 


and the guidance of the CLIP model (Radford et al.|/2021) or 2D diffusion models (Rombach et al. 
2022). For example, DreamFields (Jain et al.||2022) and CLIP-Mesh (Mohammad Khalid et al. 
2022) explore zero-shot 3D content creation using only CLIP guidance. Dreamfusion (Poole et al. 


2022) optimizes NeRF|Mildenhall et al.|(2021) with the guidance of a text-to-image diffusion model, 


achieving remarkable text-to-3D synthesis results. To address optimization speed and visual quality 
challenges, Magic3D uses low-resolution diffusion priors and a sparse 3D hash 
grid for speed, alongside an efficient differentiable render for textured 3D mesh model optimiza- 
tion. To enhance the fidelity of generated 3D models, innovative approaches refine the 3D geometry. 
For instance, Dream3D directly initializes NeRF using a generated Signed Dis- 
tance Function (SDF) for better geometry control. Latent-NeRF incorporates 
a user-provided mesh for direct occupancy loss during geometry optimization in NeRF. Subject- 
driven text-to-3D generation is gaining traction for personalized synthesis (Raj et al.| et al. [2023} Liu 
[et al.| /2023}/Melas-Kyriazi et al.|[2023). It creates subject-specific 3D assets based on p Gr Ea 


and text prompts. To achieve this, DreamBooth3D (Raj et al.||2023) proposes a 3-stage optimization 
strategy to jointly leverage the 3D consistency of NeRF together with the personalizing capability 


of the text-to-image diffusion model. One-2-3-45 2023) uses a view-conditioned 2D dif- 
fusion model (Zero123) to generate multi-view images for learning SDF-based generalizable neural 
surface reconstruction. Unlike the above approaches, our objective is not to generate conventional 
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neural representations as the final 3D output. Instead, we utilize LLMs to generate Python code that 
controls Blender’s 3D modeling based on the provided instructions. 


2.2 LARGE LANGUAGE MODELS 


Large language models (LLMs) are a promising approach to capture and represent the compressed 
knowledge and experiences of humans, projecting them into language space (Devlin et al.| 
2023). LLMs have con- 
sistently showcased remarkable performance extending beyond canonical language processing do- 
mains. They exhibit the capability to address intricate tasks that were once considered the exclusive 
domain of specialized algorithms or human experts. These tasks encompass areas such as mathemat- 
ical eet ee 2023} 2022), medicine (Jeblick et al.) pa ae 
and planning (Zhang et al. 2023} 2023}[2022) For instance, Huang 
(2022) leverage the LLMs’ internet-scale domain knowledge and emergent zero-shot planning 
abilities to perform complex task planning and reasoning. (2023) explore LLMs in multi- 
agent coordination in scenarios encompassing multiple task objectives. (Zeng et al.| introduce 
a modular framework that leverages structured dialogue via prompting between multiple large pre- 
trained models to make joint predictions for new multimodal tasks, without requiring finetuning. 
Moreover, specialized LLMs for particular applications have been explored such as Codex (Chen| 
for Python code generation, Galactica for scientific knowledge, 
and LaMDA (Thoppilan et al. for dialogue applications. This work explores the innovative 
application of LLMs in 3D modeling, employing them to control 3D procedural generation. 


3 3D-GPT 


3.1 TASK FORMULATION 


The overall objective is the generation of 3D content based on a sequence of natural language in- 
structions, denoted as £ = [L,]. The initial instruction, designated as Lo, serves as a comprehensive 
description of the 3D scene, such as “A misty spring morning, where dew-kissed flowers dot a lush 
meadow surrounded by budding trees”. Subsequent instructions are employed to modify the exist- 
ing scene, as exemplified by instructions like “Transform the white flowers into yellow flower” or 
“translate the scene into a winter setting”. 


To accomplish this objective, we introduce a framework named 3D-GPT, which empowers LLMs to 
act as problem-solving agents. We point out that employing LLMs to directly create every element 
of 3D content poses significant challenges. LLMs lack specific pre-training data for proficient 3D 
modeling and, as a result, may struggle to accurately determine which elements to modify and how 
to modify them based on given instructions. 


To avoid this challenge, we employ procedural generation to control the 3D content creation. It 
makes use of adaptable parameters and rule-based systems to interface with 3D software (e.g., 


Blender) so as to efficiently conduct 3D modeling (Deitke et al.|/2022} [Greff et al.|/2022 
2021} Jiang et al) 2018 2023). 


Our approach conceptualizes the 3D procedural generation engine as a set of functions, denoted as 
F = {F;}, where each function F; takes parameters P; as input. 


Within the 3D-GPT framework, for each instruction L;, we frame the modeling task as the selection 
of a subset F C F, combined with the inference of corresponding parameters P; for each func- 


tion F} in this subset. The ultimate aim is to ensure that the F collaboratively generates a 3D scene 
that aligns with the descriptions provided in £. By adeptly addressing both function selection and 
parameter inference for every sub-instruction L;, 3D-GPT generates a Python script file with the 
capacity to manipulate Blender’s 3D modeling environment, thereby proficiently meeting the user’s 
specified requirements outlined in the instruction sequence £. 


3.2 MODELING TOOL PREPARATION 


In our framework, we utilize Infinigen (2023), a Python-Blender-based procedural 
generator equipped with a rich library of generation functions. To empower LLMs with the ability 
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to proficiently leverage Infinigen, we provide crucial prompts for each function F}. These prompts 
encompass function documentation D}, easily understandable code Cj, required information J;, and 
a usage example Fj: 


e D}: it entails a comprehensive explanation of the input parameter P}, coupled with a clear 
elucidation of the function’s purpose and functionality. 

e Cj: we present meticulously restructured and highly readable function code, ensuring that 
it is accessible and comprehensible for LLMs. 

e J;: this component outlines the specific information required to infer the function param- 
eters, thereby assisting LLMs in understanding the context and prerequisites of each func- 
tion. For example, in the case of a flower generation function, J; indicates the required 
visual properties for modeling, such as flower color, flower petal appearance (e.g., size, 
curve, and length), and flower center appearance. 

e Ej: we provide illustrative examples to demonstrate how to infer the parameter P; from 
the accompanying text descriptions and subsequently invoke the function. Continuing with 
the example of a flower generation function, FE; includes a practical demonstration of how 
to infer the parameters and call the function based on input text like “a sunflower.” 


By providing LLMs with these resources, we enable them to leverage their core competencies in 
planning, reasoning, and tool utilization. As a result, LLMs can effectively harness Infinigen for 3D 
generation based on language instructions in a seamless and efficient manner. 


3.3 MULTI-AGENTS FOR 3D REASONING, PLANING AND TOOL USING 


Upon tool preparation, 3D-GPT employs a multi-agent system to tackle the procedural 3D model- 
ing task. This system comprises three integral agents: the task dispatch agent, the conceptualization 
agent, and the modeling agent, illustrated in Figure [I] Together, they deconstruct the procedural 
3D modeling task into manageable segments, with each agent specializing in distinct aspects: 3D 
reasoning, planning, and tool utilization. The task dispatch agent plays a pivotal role in the plan- 
ning process. It leverages user instructions to query function documents and subsequently selects 
the requisite functions for execution. Once functions are selected, the conceptualization agent en- 
gages in reasoning to enrich the user-provided text description. Building upon this, the modeling 
agent deduces the parameters for each selected function and generates Python code scripts to invoke 
Blender’s API, facilitating the creation of the corresponding 3D content. Additionally, images can 
be rendered using Blender rendering capability. 


Task Dispatch Agent for Planing. The Task Dispatch Agent, armed with comprehensive informa- 
tion of all available functions F within the procedural generation, efficiently identifies the requi- 
site functions for each instructional input. For instance, when presented with an instruction such 
as “translate the scene into a winter setting”, it pinpoints functions like add_snow_layer() and 
update_trees(). This pivotal role played by the task dispatch agent is instrumental in facilitating 
efficient task coordination between the conceptualization and modeling agents. Without it, the con- 
ceptualization and the modeling agents have to analyze all provided functions F for each given 
instruction. This not only increases the workload for these agents but also extends processing time 
and can potentially lead to undesired modifications. 


The communication flow between the LLM system, the user, and the task dispatch agent is outlined 
as follows: 


— LLM System: You are a proficient planner for selecting suitable functions based on user 
instructions. You are provided with the following functions: < (Fj, F) >. Below are a few 
examples of how to choose functions based on user instructions: < Eee a 
— User: My instruction is: < L; >. 

— Task Dispatch Agent: Given the instruction < Li >, we determine the sublist of functions F that 
need to be used for 3D modeling. 


In this context, < (Frare, F’*“8°) > represents a list of function names and concise function usage 


descriptions for all available functions and examples < B’@°%*4ispatch > provide guided examples for 
prompt-based instructions. 
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Figure 1: 3D-GPT Overview. 3D-GPT employs LLMs as a multi-agent system with three collab- 
orative agents for procedural 3D generation. These agents consult documents from the procedural 
generator, infer function parameters, and produce Python code. The generated code script interfaces 
with Blender’s API for 3D content creation and rendering. 


Engaging the Conceptualization Agent for Reasoning. The description may not explicitly provide 
the detailed appearance descriptions needed for modeling. For instance, consider the description: “A 
misty spring morning, where dew-kissed flowers dot a lush meadow surrounded by budding trees”. 
When employing a tree modeling function that necessitates parameters such as tree branch length, 
tree size, and leaf type, it becomes apparent that these specific details are not directly stated in 
the given text. When instructing the modeling agent to infer parameters directly, it tends to provide 
simplistic solutions, like using default or reasonable values from the parameter document or copying 
values from prompting examples. This reduces diversity in generation and complicates parameter 
inference. 


To alleviate this issue, we introduce the conceptualization agent which collaborates with the task 
dispatch agent to augment the user-provided text description (L;). After the task dispatch agent 
selects the required functions, we send the user input text and the corresponding function-specific 
information to the conceptualization agent and request augmented text. For each function F}, it 


enriches L; into detailed appearance descriptions L7. The communication between the system and 
the Conceptualization Agent for instruction < L; > and function < F} > is as follows: 


— LLM System: You are a skilled writer, especially when it comes to describing the appearance 
of objects and large scenes. Given a description < L; >, provide detailed descriptions for the 
following information < I; >. For terms not mentioned in the description, use your imagination 
to ensure they fit the text description. 

— Conceptualization Agent: Given the < Li > and requested information < I; >, the extended 


description is: < Lİ > 


Modeling Agent for Tool Using. After conceptualization, the 3D modeling processing is targeted 
to convert the detailed human language to machine-understandable language. 


In our framework, our modeling agent manipulates the functions of procedural modeling in the 
library to create a realistic 3D model. For each function F; and user instruction L;, the task dispatch 


agent receive augmented context Er from the conceptualization agent. For each function Fj, we 
have the code C}, function documentation Dj, and one usage example Æj. The modeling agent 
utilizes this information to select the appropriate functions and deduce the corresponding parameters. 
Subsequently, the modeling agent generates Python code that accurately calls the selected function 
(e.g., call it in the loop, not call) and correctly passes the inferred parameters with the appropriate 
data types to the function. 


The two-turn utterances are based on the following pattern: 
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“A vibrant autumn forest, with trees ablaze in “A misty spring morning, where dew- “A serene winter landscape, with snow- 
shades of red, orange, and gold, as a gentle kissed flowers dot a lush meadow covered evergreen trees and a frozen lake 
breeze rustles the fallen leaves.” surrounded by budding trees.” reflecting the pale sunlight.” 


+ Ss 


Wterin~ 


er 


“The mountains, majestic and snow-capped, “The desert, an endless sea of shifting sands, “The lake, serene and glassy, mirrored the 


stood like sentinels guarding the vast stretched to the horizon, its rippling dunes cloudless sky above, reflecting the 
expanse of the valley, their peaks catching the golden rays of the setting sun, surrounding mountains and the graceful 
disappearing into the swirling mist that creating an ever-changing landscape of flight of a heron, as lily pads floated like 
clung to their rugged slopes.” shadows and light.” emerald jewels upon its tranquil surface." 
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Figure 2: Visual Examples of Instruction-Based 3D Scene Generation. 3D-GPT can construct 


large 3D scenes that align with the provided initial instruction. We demonstrate that the rendered 
images contain various visual factors in line with the given instructions. 


— LLM System: You are a good 3D designer who can convert long text descriptions into pa- 
rameters, and is good at understanding Python functions to manipulate 3D content. Given the text 
description < Lf >, we have the following function codes < C; > and the document for function < Dj; >. 
Below is an example bout how to make function calls to model the scene to fit the description: < id >. 
Understand the function, and model the 3D scene that fits the text description by making a function call. 


— Modeling Agent: Given the description < Li >, we use the following functions: ..., and their 
respective parameter values ... are adopted. 


Blender Rendering. The Modeling agent ultimately supplies the Python function calls with in- 
ferred parameters, which are employed for Blender node control and rendering, resulting in the 
production of the final 3D mesh and RGB results. 


4 EXPERIMENTS 


Our experimentation begins by showcasing the proficiency of 3D-GPT in consistently generating 
results that align with user instructions, encompassing scenarios involving both large scenes and 
individual objects. Subsequently, we delve into specific examples to illustrate how our agents ef- 
fectively comprehend tool functionalities, access necessary knowledge, and employ it for precise 
control. To deepen our understanding, we conduct an ablation study to systematically examine the 
contributions of each agent within our multi-agent system. 


4.1 3D MODELING 


Large Scene Generation. We investigate the capability of 3DGPT to control modeling tools based 
on scene descriptions without any training. To conduct this experiment, we generated 100 scene 
descriptions using ChatGPT with the following prompt: “You are a good writer, provide 10 different 
natural scene descriptions for me”. We collected responses to this prompt 10 times to form our 
dataset. In Figure[2| we present the multi-view rendering results of 3D-GPT. These results indicate 
that our approach is capable of generating large 3D scenes that generally align well with the provided 
text descriptions, showcasing a noticeable degree of diversity. Notably, all 3D outcomes are directly 
rendered using Blender, ensuring that all meshes are authentic, thereby enabling our method to 
achieve absolute 3D consistency and produce real ray-tracing rendering results. 


Fine-detail Control for Single Class. Apart from generating large scenes from concise descrip- 
tions, we assess the capabilities of 3D-GPT for modeling objects. We evaluate crucial factors such 
as curve modeling, shape control, and an in-depth understanding of object appearances. To this end, 
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Figure 3: Single Class Control Result. Our method effectively acquires the necessary knowledge 
for modeling, enabling precise object control in terms of shape, curve, and key appearance capture. 
The generated results closely align with the given text. 


(a). “A misty spring morning, where dew-kissed flowers. (b). “Transform the white flowers into (c). “Eliminate the green trees and decrease the 
dot a lush meadow surrounded by budding trees.” yellow flowers. ” number of yellow flower trees.” 
(d). “Create a cloudy weather.” (e). “Place trees on the mountain.” (f). “Translate the scene into a winter setting.” 


Figure 4: Subsequence Instruction Editing Result. (a) Initial instruction-generated scene. (b)-(f) 
Sequential editing steps with corresponding instructions. Our method enables controllable editing 
and effective user-agent communication. 


we report the results of fine-grained object control. This includes nuanced aspects such as object 
curves, key appearance features, and color, all derived from input text descriptions. We employ 
random prompts to instruct GPT for various real-world flower types. As depicted in Figure J] our 
method adeptly models each flower type, faithfully capturing their distinct appearances. This study 
underscores the potential of 3D-GPT in achieving precise object modeling and fine-grained attribute 
control of object types and visual characteristics. 


Subsequence Instruction Editing. Here, we test the ability of 3D-GPT for effective human-agent 
communication and task manipulation. In Figure |4| we observe that our method can comprehend 
subsequence instructions and make accurate decisions for scene modification. Note that, unlike 
the existing text-to-3D methods, 3D-GPT maintains a memory of all prior modifications, thereby 
facilitating the connection of new instructions with the scene’s context. Furthermore, our method 
eliminates the need for additional networks for controllable editings{Zhang et al.|(2023b). This study 
underscores the efficiency and versatility of 3D-GPT in adeptly handling complex subsequence in- 
structions for 3D modeling. 


Individual Function Control. To evaluate the effectiveness of 3D-GPT in tool utilization, we 
present an illustrative example that highlights our method’s ability to control individual functions 
and infer parameters. Figure |5| exemplifies the capability of 3D-GPT to model sky appearances 
based on input text descriptions. It is worth noting that the function responsible for generating the 
sky texture does not directly correlate color information with sky appearance. Instead, it relies on 
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(a). “The sun shines brightly in a clear blue sky. (b). “Thick gray clouds cover the entire sky, (c). “The sky appears hazy, with a soft, 
Not a cloud in sight.” diffusing the sunlight.” diffuse sunlight filtering through.” 


(d). “The sky is painted in shades of orange (e). “The sky is a mixture of soft purples (f). “A canopy of stars fills the dark night 
and pink as the sun sets behind the island.” and pinks as the sun rises.” sky, with the island bathed in moonlight.” 


(a). Solution: From the description of "clear blue sky" and "not a cloud in sight", we can infer that the sun_intensity should be high, the sun_elevation 
should be around 90 (daytime), the sun_rotation can be set to 0. The air_density and dust_density should be low, so we can set them to 1 and 0 
respectively. Since there are no clouds, the cloud_density should be 0. To make the sky appear bluer, we can increase the ozone value, so let's set it to 2. 
We can model the sky by calling the following function: 


Dpython 
sky_texture_node(sun_intensity=high', sun_elevation=90, sun_rotation=0, air_density=1, dust_density=0, ozone=2, cloud_density=0) 


Figure 5: Single Function Control Result. Visual result (top) and modeling agent response exam- 
ple (bottom). Our method demonstrates a high degree of accuracy in inferring algorithm parameters, 
even when they do not possess a direct connection to visual appearance. 


the Nishita-sky modeling method, which requires a profound understanding of real-world sky and 
weather conditions, considering input parameters. Our method adeptly extracts crucial information 
from the textual input and comprehends how each parameter influences the resulting sky appearance, 
as evident in Figure [5] (c) and (d). These results demonstrate that our method can effectively use 
individual functions as well as infer corresponding parameters. 


4.2 ABLATION STUDY 


We conduct separate ablation studies for the Conceptualization Agent and Task Dispatch Agent. 
Our assessment focused on CLIP scores (Radford et al.|/2021), failure rates, and parameter diversity, 
quantified using the categorical Shannon Diversity Index. The CLIP score measures the alignment 
between text and generated images. The failure rate represents the percentage of system failures 
due to issues such as incorrect datatypes, wrong response patterns, or missing parameters from the 
Modeling Agent. Parameter diversity aims to gauge the diversity of generated outputs. 


Metrics/ | CLIP Metrics/ | CLIP Failure Parameter 
Method Score Method | Score Rate Diversity 
w/o TDA | 22.79 w/o CA | 21.51 3.6% 6.32 
Ours 29.16 Ours 30.30 0.8% 7.34 

(a) Ablation Study of Task Dispatch Agent. (b) Ablation Study of Conceptualization Agent. 


Table 1: Ablation Study. “w/o CA” indicates without the Conceptualization Agent, “w/o TDA” 
indicates without the Task Dispatch Agent. 


Case Study of Task Dispatch Agent. For the Task Dispatch Agent, the CLIP score is measured 
using 100 initial scene descriptions, each appended with one additional subsequence instruction for 
each scene. Table [I] (a) shows that without the Task Dispatch Agent, the CLIP score dropped from 
29.16 to 22.79. It is important to note that the Task Dispatch Agent primarily impacts the perfor- 
mance of subsequence instructions, as all functions are utilized for the initial instruction. These 
findings underscore the pivotal role of the Task Dispatch Agent in managing communication flow. 
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“Carnation” 
i 


The carnation is a charming flower known for its 
delicate details. Its center size typically ranges 


from 1 to 2 cm in diameter. The petals can vary in 


length, averaging around 3-5 cm, and they 
gracefully encircle the center. The petal width 
measures approximately 1-2 cm, and it's usually 
about half the length of the petals. The petals 


exhibit a soft, rounded shape, lending the flower an 
inviting openness. Carnation petals are known for 


their unique and delicate fringed edges. These 


“Gerbera” 


The Gerbera, renowned for its striking 
beauty, boasts a center size of 
approximately 3-5 cm. Its vibrant petals, 
often measuring 7-12 cm, are notably larger 
than the center. Petals exhibit a width of 
about 1-2 cm, nearly half their length, 
showcasing an elongated shape. They 


possess a gentle curvature, enhancing their 
roundness. The Gerbera's open bloom 
creates a captivating display, revealing a 


“Dandelion” 
4 


The dandelion, a common wildflower, 
showcases unique characteristics. Its center, 
typically 1-2 cm in diameter, anchors a 
profusion of bright yellow petals. Each 
petal averages 2-3 cm in length, about 
twice the size of the center. Dandelion 
petals are quite narrow, usually measuring 
around 0.1 to 0.2 cm in width. Dandelion 
petals display a gentle curvature, 
contributing to the flower's spherical form. 


PER | Singes can give the petals a slightly wrinkled or SOE. Its They often appear slightly wrinkled along 
(pif | fed appearance, adding to their charm. . petals showcase a spectrum of colors, from the edges, adding a touch of texture. Their 
7 Carnations come in a wide spectrum of colors, bold reds to soft pinks, and they are vibrant yellow hue and densely clustered 
from vibrant reds to pastel pinks, and their petals densely packed, resulting in lush, eye- petals create a stunning floral carpet in 
oe are densely packed, creating lush, full blooms. catching floral arrangements. open fields. 


Figure 6: Conceptualization Agent Case Study. The enriched textual evidence demonstrates that 
the Conceptualization Agent provides essential knowledge for parameter inference (highlighted in 
green). For each subfigure, we compare the 3D model without (Top) and with (Bottom) agent. The 
models generated with the agent better match the text description than those without it. 


Case Study of conceptualization Agent. For the Conceptualization Agent, the CLIP score is mea- 
sured using 100 initial scene descriptions. Table[I](b) displays the results, indicating that without the 
Conceptualization Agent, both text alignments (CLIP score) and parameter diversity decreased sig- 
nificantly. Moreover, the failure rate increased substantially, which adversely impacts the efficiency 
of the entire modeling process. Figure [6jillustrates how the Conceptualization Agent facilitates the 
acquisition of essential knowledge for 3D modeling, providing a visual comparison of results with 
and without its involvement. When the Conceptualization Agent is engaged, the generated results 
closely align with the appearance of the intended flower type, highlighting its invaluable contribution 
to elevating overall 3D generation quality and fidelity. 


5 DISCUSSION AND CONCLUSION 


We have introduced 3D-GPT, a novel training-free framework for instruction-driven 3D modeling 
seamlessly integrated with procedural generation. Leveraging the capabilities of LLMs, 3DGPT 
aims to enhance human-AI communication in the context of 3D design. Our approach involves the 
collaborative efforts of three agents functioning as a cohesive 3D modeling team, ultimately yielding 
a 3D modeling file as output, as opposed to conventional 3D neural representations. Moreover, 
our method consistently delivers high-quality results, showcases adaptability to expansive scenes, 
ensures 3D consistency, provides material modeling and editing capabilities, and facilitates real ray 
tracing for achieving lifelike visualizations. Our empirical experiments show the potential of LLMs 
for reasoning, planning, and tool using in procedural 3D modeling. 


Limitations and Potential Directions. While our framework has demonstrated promising 3D 
modeling results closely aligned with user instructions, it is essential to acknowledge several 
limitations: 1) Limited curve control and shading design: Currently, our framework lacks advanced 
capabilities for precise curve control and intricate shading design. Tasks involving the manipulation 
of tree branches or the blending of colors for leaf textures remain challenging. 2) Dependence 
on procedural generation algorithms: the effectiveness of our framework is contingent on the 
quality and availability of procedural generation algorithms. This reliance may limit results in 
specific categories, such as hair and fur. 3) Handling multi-modal instructions: challenges arise 
in processing multi-modal instructions, including audio and image inputs, potentially leading 
to information loss. These limitations offer valuable insights for shaping future research and 
development in the field. We highlight three compelling directions for future investigation: 

LLM 3D Fine-Tuning: It is promising to fine-tune LLMs to enhance their capabilities in geometry 
control, shading design, and fine-texture modeling. This refinement will make LLMs more adept at 
handling intricate 3D modeling tasks and grant greater creative control over the resulting 3D scenes. 
Autonomous Rule Discovery: Building on the demonstrated tool-making capabilities of LLMs, 
one direction is to develop an autonomous 3D modeling system that reduces human involvement. 
This could empower LLMs to autonomously discover generation rules for new object classes and 
scenes, thus expanding the creative potential. 

Multi-Modal Instruction Processing: To achieve more comprehensive and expressive 3D modeling 
based on varied user inputs, it is crucial to enhance the system’s ability to comprehend and respond 
to multi-modal instructions. This would facilitate richer and more diverse 3D modeling outcomes, 
shaped by a broader spectrum of user inputs. 
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6 APPENDIX 


6.1 ADDITIONAL RESULT 


We kindly request the reader to consider visiting |https://chuny |.github.io/3 DGPT/3dgpt.html| to 


view our high-quality 3D results. 


6.2 PROMPT EXAMPLE 


We offer an illustrative example of the prompt used for Adding Trees, presenting the Document, 
Code, Information, and a Usage Example. 


Document: 
Explanation: The function takes a built natural scene as input and adds base trees to it. 
Inputs: 
scene: The built natural scene. 
density(float): The density of the trees. 
distance_min(float): Minimum distance between trees. 
leaf_type(string): The type of leaf on the tree. Select one from the list [‘leaf’, ‘leaf_broadleaf’, ‘leaf_ginko’, ‘leaf_maple’, ‘flower’, ‘None’]. 
‘Leaf’ allows further custom settings like changing the leaf shape and color. ‘Leaf_broadleaf’, ‘leaf_ginko’, and ‘leaf_maple’ build the leaves with 
predefined shapes: broad, ginko, and maple respectively. ‘Flower’ will create flowers on the tree instead of leaves. ‘None’ will not generate leaves on 
the tree. 
fruit_type(string): The type of fruit on the tree. Select one from the list [‘apple’, ‘blackberry’, ‘coconut_green’, ‘durian’, ‘starfruit’, ‘strawberry’, 
‘custom _fruit’, ‘None’]. Each of the fruit types will create the corresponding fruit on the tree. ‘Custom_fruit’ can create customizable fruit on the tree 
with further adjustments. If the desired fruit is not on the list, ‘custom_fruit’ should be chosen. ‘None’ will not generate any fruit on the tree. 


Code: 
python 
import TreeFactory 
def add_trees(scene, density, distance_min, leaf_type, fruit_type, leaf_density): 
trees = TreeFactory.create(density, leaf_type, fruit_type, leaf_density) # create trees 
scene.placement(trees, distance_min) # place trees in the scene 
return 


Information: 
"Possible trees in the scene, tree types, leaf density, possible seasons, tree size, appearance of the tree branch, branch type, and leaf color." 


Example: 

Question: Given the text description of the scene, analyze the function parameters and call the function to add more trees that fit the text 
description. 

Solution: From the text description, three different types of trees are identified in the scene: birch, dogwood, and Japanese maple. From “a cluster 
of,” we infer the tree density is relatively high; let’s set it to 0.2. Birch tree leaves are typically oval or triangular, so let’s set the leaf type to ‘leaf’ to 
allow further customization. There are no fruits mentioned for the birch tree in the description. Given the scene likely depicts autumn, the leaf_density 
should not be too high; let’s set it to 0.02. For the dogwood tree, as the leaves are described as oval or elliptical with smooth edges, the leaf type is set 
to ‘leaf’. It is mentioned that “clusters of small, vibrant berries cling to its branches,” so we can choose the fruit_type as ‘blackberry’. The 
leaf_density is set to 0.02. For the Japanese maple tree, the leaf type is set to ‘leaf_maple’. No fruits are mentioned for the maple tree. As the scene 
likely takes place in autumn, the leaf density for the maple tree should be high; let’s set it to 0.1. The function can be called three times to create three 
different trees: 


“python 

add_trees(scene, density=0.2, distance_min=1, leaf_type='leaf’, fruit_type='None'’, leaf_density=0.02) # create birch trees 
add_trees(scene, density=0.1, distance_min=1, leaf_type='leaf’, fruit_type='blackberry', leaf_density=0.02) # create dogwood trees 
add_trees(scene, density=0.1, distance_min=1, leaf_type='leaf_maple', fruit_type='None', leaf_density=0.1) # create Japanese maple trees 


Figure 7: Prompt Example of Adding Trees. 
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Document: 
Input: 
sun_intensity: Multiplier for sun disc lighting. (Choose from ‘low’, ‘median’, ‘high’) 
sun_elevation: Rotation of the sun from the horizon (in degrees). (0: sunset, sun rising, 90: daytime, -10: night) 
sun_rotation: Rotation of the sun around the zenith (in degrees). 
air_density: Density of air molecules. (0: no air, 1: clear day atmosphere, 2: highly polluted day) 
dust_density: Density of dust and water droplets. (0: no dust, 1: clear day atmosphere, 5: city-like atmosphere, 10: hazy day) 
ozone: Density of ozone molecules; useful to make the sky appear bluer. (0: no ozone, 1: clear day atmosphere, 2: city-like atmosphere). A 
higher value yields a bluer sky. 
cloud_density: Density of the clouds, ranging from 0 to 0.04. (0.01: very thick cloud, 0.04: very heavy cloud) 
Output: 
texture_color_output: The output texture color of the sky. 


Code: 
mpi 
import nishita_sky_modelling 
def sky_texture_node(sun_intensity, sun_elevation, sun_rotation, air_density, dust_density, ozone, cloud_density): 
texture = nishita_sky_modelling.model_sky(sun_intensity, sun_elevation, sun_rotation, air_density, dust_density, ozone, cloud_density) 
return texture 


Information: 
The time of day depicted in the scene (morning/noon/evening), the possible season, the condition of the air, the blueness of the sky, and the cloud 
density. 


Example: 
Question: 
Given the text description of the scene: “The river, reflecting the clear blue of the sky, glistened like a silver ribbon as it wound its way through 
the lush valley, its tranquil waters whispering secrets to the ancient trees,” analyze the function parameters and call the function to generate the sky. 


Solution: 

From the description of a “clear blue sky,” we infer that sun_intensity cannot be low, let's set it as "median". As the sky is clear and blue, the 
sun_elevation must not be very low, indicating it's not morning or evening; we can set it to 50 degrees. 

Sun rotation does not impact the appearance of the sky significantly; thus, we will set it to 0 degrees. To align with the description of "clear 
blue," we should choose lower values for air_density and dust_density. Hence, we will set air_density to 1 and dust_density to 0. To enhance the 
blueness of the sky, we can increase the ozone value to 2. 

As the description suggests a clear, blue sky, the cloud density must be low. Therefore, we will set it to 0.0005. The sky can be modeled by 
invoking the function as follows: 


“python 


sky_texture_node(sun_intensity='median', sun_elevation=50, sun_rotation=0, air_density=1, dust_density=0, ozone=2, cloud_density=0.0005) 


Figure 8: Prompt Example of Sky Modeling Function. 


6.3 ABLATION STUDY DETAILS 


We conduct separate ablation studies for the Conceptualization Agent and Task Dispatch Agent, 
evaluating their performance based on CLIP scores, failure rates, and parameter diversity. 


We use the implementation of Radford et al.| (2021) to calculate CLIP score. It measures cosine 
similarity in the CLIP hidden space, indicating the alignment between generated images and text 
descriptions. For 3D scene CLIP scores, we placed a camera at the scene’s center, capturing 50 
images by rotating the camera 360 degrees. In the Conceptualization Agent evaluation, we used only 
the initial instruction. For the Task Dispatch Agent, both the initial instruction and one subsequence 
instruction were used. 


The failure rate reflects the Modeling Agent’s response. Failures can occur when the method cannot 
extract the correct pattern via the parser, generates data with an incorrect datatype, or omits/adds 
parameters to the function call. We evaluated five functions for each scene, recording the failure 
rate. We also assessed 500 parameters per scene, employing the Shannon Diversity Index to measure 
parameter diversity, and categorizing float parameters into 100 bins. 


6.4 AGENT RESPONSES AND VISUAL RESULT 
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(a). “The sun shines brightly in a clear blue sky. (b). “Thick gray clouds cover the entire sky, (c). “The sky appears hazy, with a soft, 
Not a cloud in sight.” diffusing the sunlight.” diffuse sunlight filtering through.” 


AIEEE 


(d). “The sky is painted in shades of orange (e). “The sky is a mixture of soft purples (f). “A canopy of stars fills the dark night 
and pink as the sun sets behind the island.” and pinks as the sun rises.” sky, with the island bathed in moonlight.” 


© (a). Solution: From the description of "clear blue sky" and "not a cloud in sight", we can infer that the sun_intensity should be high, the sun_elevation 
should be around 90 (daytime), the sun_rotation can be set to 0. The air_density and dust_density should be low, so we can set them to 1 and 0 
respectively. Since there are no clouds, the cloud_density should be 0. To make the sky appear bluer, we can increase the ozone value, so let's set it to 2. 
We can model the sky by calling the following function: 
‘python 
sky_texture_node(sun_intensity='high’, sun_elevation=90, sun_rotation=0, air_density=1, dust_density=0, ozone=2, cloud_density=0) 


(b). Solution: From the description of "Thick gray clouds cover the entire sky", the sun_intensity should be low to account for the diffused sunlight. Let's 
set it as "low". The sun_elevation should be high to indicate that it is daytime. Let's set it to 90. The sun_rotation does not affect the sky appearance, so we 
can set it to 0. The air_density and dust_density should be high to represent the presence of thick clouds. Let's set air_density to 2 and dust_density to 5. 
Since the sky is covered with thick gray clouds, the ozone value can be low. Let's set it to 0. As mentioned, the entire sky is covered with thick gray clouds, 
so the cloud_density should be high. Let's set it to 0.04. We can model the sky by calling the following function: 

Spy 


sky_texture_node(sun_intensity='low', sun_elevation=90, sun_rotation=0, air_density=2, dust_density=5, ozone=0, cloud_density=0.04) 


(c). Solution: From the description of "hazy sky" and "soft, diffuse sunlight", we can infer that the sun_intensity should be low. Let's set it as "low". The 
sun_elevation should be low as well, indicating a time close to sunrise or sunset. Let's set it to -10. The sun_rotation does not affect the sky appearance, so 
we can set it to 0. To fit the description of a hazy sky, the air_density and dust_density should be higher. Let's set air_density to 2 and dust_density to 5. 
The ozone value can be set to 1 to maintain a natural sky color. Since the sky is hazy, the cloud_density should be higher as well. Let's set it to 0.02 to 
represent a moderately cloudy sky. We can model the sky by calling the following function: 

ppi 


sky_texture_node(sun_intensity='low', sun_elevation=10, sun_rotation=0, air_density=2, dust_density=5, ozone=1, cloud_density=0.02) 


(d). Solution: From the description of "shades of orange and pink", the sun_intensity can be set as "low" to create a softer and warmer lighting. As the sun 
is setting, the sun_elevation can be set to 0. The sun_rotation does not affect the sky appearance, so it can be set to 0. To fit the description of "shades of 
orange and pink", the air_density and dust_density should be low, so we can set air_density to 1 and dust_density to 0. To create a warmer and more 
vibrant sky, we can set the ozone value to 1. As the sun is setting, there might be some clouds in the sky, so we can set the cloud_density to 0.01. We can 
model the sky by calling the following function: 

A 


sky_texture_node(sun_intensity='low', sun_elevation=0, sun_rotation=0, air_density=1, dust_density=0, ozone=1, cloud_density=0.01) 


(e). Solution: From the description of "soft purples and pinks", the sun_intensity can be set to "low" to create a softer lighting effect. Since the sun is 
rising, the sun_elevation can be set to a low value, such as 10. The sun_rotation does not affect the sky appearance, so it can be set to 0. To create a purple 
and pink sky, the air_density and dust_density can be set to low values, such as | and 0. The ozone value can be set to 0 to avoid making the sky bluer. 
Since the sky is described as a mixture of colors, the cloud_density can be set to a low value, such as 0.001. We can model the sky by calling the following 
function: 

‘python 

sky_texture_node(sun_intensity='low', sun_elevation=10, sun_rotation=0, air_density=1, dust_density=0, ozone=0, cloud_density=0.001) 


(f). Solution: From the description of "dark night sky" and "canopy of stars", we can assume that the sun is not visible in the sky. Therefore, the 
sun_intensity can be set to 'low'. Since it is a night scene, the sun_elevation can be set to -10. The sun_rotation does not affect the sky appearance, so it can 
be set to 0. As it is a dark night sky, the air_density and dust_density can be set to 0. The ozone value can also be set to 0 as there is no mention of blueness 
in the description. Since there are no clouds mentioned, the cloud_density can be set to 0. We can model the sky by calling the following function: 
‘python 

sky_texture_node(sun_intensity='low', sun_elevation=-10, sun_rotation=0, air_density=0, dust_density=0, ozone=0, cloud_density=0) 


Figure 9: Single Function Control Result. Visual result (top) and modeling agent response exam- 
ple (bottom). Our method demonstrates a high degree of accuracy in inferring algorithm parameters, 
even when they do not possess a direct connection to visual appearance. 
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